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BACKGOUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to the validation of character code sequences. 

2. Discussion of the Related Art 

Double-byte character encoding is commonly used for a number of purposes, 
among them encoding complex character sets such as GB 2312-80, the simplified 
Chinese characters used in mainland China. GB 2312-80 contains 7,445 Chinese 
characters represented as a pair of bytes wherein each byte is a number from 161 to 254. 
This allows the mixing of the Chinese characters with conventional ASCII text, which is 
represented by byte values in the range of 0 to 127. Technically, the simultaneous 
representation of GB 2312-80 with ASCII is called EUC-CN encoding, though we refer 
to it as GB 2312-80 throughout this specification for simplicity. This necessarily implies 
that bytes in the range of 161 to 254 must come in pairs and any string of such characters 
must have an even number of such bytes in a row between any two single-byte ASCII 
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characters. Byte values in the range of 128 to 160 are invalid for GB 2312-80. Despite 
these rules, invalid characters and sequences, collectively referred to as "noise", is found 
to occur in 5% to 10% of Chinese webpages and newswire texts. The origins of this 
noise is obscure. 

Applications currently available for the processing of double-byte encodings are 
inadequate to cope with noise. For example, GB to Unicode converters simply crash on 
the first invalid byte sequence and all information following the noise is lost. 

Repairing such noise presents a problem of ambiguity. For example, consider 
the case of a nine-byte sequence of GB 2312-80 characters, all in the range of 161-254 - 
which "half character" is the noise to be discarded? Discarding any one of the bytes will 
likely leave four perfectly valid Chinese characters, but in an incomprehensible sequence. 
In probability, only one of the bytes may be discarded so as to produce an intelligible 
string of characters. 

What is needed is a method of validating strings of double-byte characters to 
detect and remove such noise. 

SUMMARY OF THE INVENTION 

Disclosed is a method of validating a byte sequence having a plurality of states, 
the method comprising designating one or more noise states from among the plurality of 
states, generating a most probable state sequence for the byte sequence, utilizing said 
state sequence to identify all noise in the byte sequence, and localizing said noise in said 
noise states. 
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In another aspect of the invention, the noise is then deleted from the byte 
sequence. 

In another aspect of the invention, the noise state is an ASCII state. 

In another aspect of the invention said generating of a most probable state 
sequence comprises calculating P(Xo ... X„ | So . . . S„), representing the conditional 
probabilities of said byte sequence given a state sequence. 

In another aspect of the invention, said calculating P(X 0 ... X„ | S 0 . . . S n ) 
comprises assigning a state label Si to each i th byte X of the byte sequence so as to 
maximize the equation: 



wherein P 0 (S 0 ) is the initial distribution of states; A(S i | 5 M ) is a "state-to-state" 

transmission matrix; and B(X, \ S,) i s a "byte-from-state" matrix of the probabilities of 
generating a byte value X given a state Si. 
In another aspect of the invention, 



where each /?(Sm -»S0 is the probability that a particular Si state immediately follows an 
Sn state in a valid byte sequence having a states. 
In another aspect of the invention, 




A(S i \S i _ l ) = 
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' p(A->A) p(A->GB\) p(A->GB2) ' 
p{GBl -> A) p(GBl -> GBl) p(GBl -> GB2) 
p{GB2 -> A) p(GB2 -> GBl p(GB2 -> GBl) 



where each /?(Sj.i -»Si) is the probability that a particular Si state immediately follows an 
Sj-i state in a valid byte sequence having three states. 
In another aspect of the invention, 



0.995157 0.004843 0 
0 0 1 

0.037944 0.962056 0 



and said valid byte sequence is valid text in the GB 2312-80 character set. 
In another aspect of the invention, 



B(X i \S i ) = 



h l (X l -x { +1) 
h x (X^x r +l) 



8 ,(* ( =*,+l) 

e r (X,=x r +l) 



h x {X i =x r = 255) s r (X, =x r = 255) 



W=l) 

A„(*, = x 1+ i) 

h a (X i =255) 



10 where hs(Xi) are histogram functions of the a states and a.j(Xi) are probabilities of 
associating noise with the noise state for bytes within r+1 ranges of byte values Xj. 
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In another aspect of the invention, 



0 



0 



MX,.=127) 
8,(^=128) 



0 



0 



0 



0 



B(*,\S t ) = 



8,(^=160) 
8,(^=161) 



0 



0 



MX, =161) 



h 1 {X i =161) 



s 2 (X ; =254) 
.e 3 (* ( =255) 



W=254) 



fc 2 (X ; =254) 



0 



0 



where /j s (X) are histogram functions of the states, and aj(Xi) are probabilities of 
associating noise with the ASCII state within a plurality of X* ranges for a three-state 
byte sequence. 

Another aspect of the invention further comprises, providing a valid three-state 
byte sequence having an ASCII state and comprising valid ASCII and two-byte 
characters, computing an ASCII histogram /z A (Xi) by a method comprising sampling 
valid ASCII text so as to measure the frequency of occurrence of each byte value; 
computing an unnormalized histogram of said sampling over the ASCII range of X; and 
normalizing said unnormalized histogram such that the entire column of the matrix 
containing said ASCII histogram said sums to 1, and computing a first-byte histogram 
/zi(Xi) by sampling valid two-byte text and computing the unnormalized first-byte 
histogram over the odd bytes, and normalizing said first-byte histogram such that the 
entire column of the matrix containing said first-byte histogram sums to 1, and 
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computing a second-byte histogram fe(Xi) by sampling valid two-byte text and 
computing the unnormalized second-byte histogram over the odd bytes, and normalizing 
said second-byte histogram such that the entire column of the matrix containing said 
second-byte histogram sums to 1. 
5 Disclosed is a program storage device readable by machine, tangibly embodying a 

program of instructions executable by the machine to perform method steps for 
validating a byte sequence having a plurality of states, said method comprising 
designating one or more noise states from among the plurality of states generating a most 
probable state sequence for the byte sequence, utilizing said state sequence to identify all 

10 noise in the byte sequence, and localizing said noise in said noise states. 

In another aspect of the device, said localizing of said noise in said noise states 
comprises examining each byte in said byte sequence that does not correspond to a noise 
state, determining if the byte is valid, and if the byte is not valid, then redesignating the 
state of said byte to a noise state. 

15 In another aspect of the device, the device also comprises a lookup table of valid 

bytes, wherein said determination if a byte is valid is accomplished by accessing said 
lookup table. 

Disclosed is a method of validating a byte sequence having a plurality of states 
including an ASCII state, the method comprising selecting the ASCII state as the noise 
20 state generating a most probable state sequence for the byte sequence by a method 
comprising: calculating P(Xo ... X n | So . . . S n ), representing the conditional 
probabilities of said byte sequence given a state sequence, wherein said calculating P(Xo . 
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X„ I So . . . S„) comprises assigning a state label Si to each i* byte X of the byte 
sequence so as to maximize the equation: 



wherein P 0 (S 0 ) is the initial distribution of states; A(S i | S,_,)^ a "state-to-state" 
transmission matrix; and B(X t \ S t ) is a "byte-from-state" matrix of the probabilities of 
generating a byte value X given a state Si; and utilizing said state sequence to identify all 
noise in the byte sequence, localizing said noise in said noise states, and deleting said 
noise from the byte sequence. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a diagram of the overall process of an embodiment of the invention. 
Figure 2 is a diagram of a typical Markov Model of the invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Referring to Figure 1 there is depicted the basic flow of the invention wherein a 
possibly corrupted byte sequence is passed to a state sequence labeler 100 that first 
generates a most probable state sequence to the byte sequence and then modifies the state 
sequence so as to localize all of the errors, or "noise", into a single state. The byte 
sequence and the associated state sequence are then passed to a repair module 110 that 
examines the sequences to determine if there exists any errors in the byte sequence and, 
if so, corrects them, thereby outputting a valid byte sequence. 
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Figure 2 depicts a typical Markov model for allowable state sequences for mixed 
double-byte and ASCII sequences, such as GB-type byte sequences. The state of a byte 
in this example can be one of three, namely an ASCII character (state A), a first byte of a 
two-byte character state (state GB1), or a second byte of a two-byte character state (state 
5 GB2). The states are designated by the user, dependent upon the noise he wishes to 
detect, which in this example is invalid character codes. As can be seen in the diagram, a 
single-byte ASCII character (state A) can be followed by another ASCII character (state 
A) or by the first byte (state GB1) of a double-byte GB-type character, but never can an 
ASCII character (state A) be followed by the second byte (state GB2) of a double-byte 

10 GB-type character. This is shown by the directions of the arrows leading toward and 
away from the ASCII state A. Likewise, a first GB byte (state GB1) may be followed by 
a second byte (state GB2), but never by an ASCII character (state A); and a second byte 
(state GB2) may be followed by an ASCII character (state A), but never by a first byte 
(state GB1). A violation of these rules is not permitted in the state string generated by 

15 the state sequence labeler 100 of Figure 1 and this is mathematically guaranteed by the 

zero entries in the "state-to-state" transmission matrix ' of Equation 4c, more 
fully described below. 

Mathematically forcing a proper state sequence upon a corrupted byte sequence 
with the state-to-state matrix will result in invalid character codes, that is to say that 
20 there will be bytes labeled as ASCII state that don't correspond to any valid ASCII 
character and pairs of GB1 and GB2 bytes that don't correspond to any valid two-byte 
character. These invalid characters are detected and repaired by the repair module 110. 
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As will be explained below, the repair module does this by "localizing" the invalid 
characters to one or more designated "noise states." 

The method utilized by the state sequence labeler to generate a state sequence for 
a particular byte sequence makes use of a probabilistic model For any byte sequence: 

Xo, Xi, X2, . . . X n 

(1) 

we wish to generate a corresponding state sequence: 



Sq ? Si, S2, . . . Sn 



(2) 



where each Xi is an integer from 0 to 255 (for eight-bit bytes) and each Si is either A, 1, 
or 2 for single-byte ASCII bytes, first double-byte bytes, or second double-byte bytes, 
respectively. 

We may then model the conditional probabilities of the byte sequence of Equation 
1 given the state sequence of Equation 2 as: 



P(Xo . . , Xn I So . . . Sn) — Po(So) 



n*(**iSi) 



t=0 



(3) 



where P 0 (S 0 ) is the initial distribution of states, namely Po(A) = P 0 (l) = Po(2) =1/3; 
A{S t I ^/-i ) i s the state-to-state transmission matrix and will have the properties of the 
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Markov model being utilized; ^ ' 1 i} is the "byte-from-state" matrix of the 
probabilities of generating a byte value Xi given a state Si. 

For any byte sequence having a total states, the state-to-state matrix will be a a x 
a sized matrix: 



(4a) 



The state-to-state matrix will have the following form in the case of the 
three-state Markov model shown in Figure 2: 



p(A -+ A) 
p(GB\->A) 
p(GB2 -> A) 



p(A -> GBl) 
p(GB\ -> GB\) 
p(GB2 -> GBl 



p(A -> GBl) 
p{GB\ -» GBl) 
p{GB2 -> G£2) 



where jo(Si.i — »Si) indicates the probability, or observed frequency, that a particular Si 
state for the i lh byte immediately follows a particular Sm state for the immediately 
proceeding byte (i-1) in a valid byte sequence. It is to be understood that Equation 4b is 
the matrix for a three-state byte sequence. When applied to the GB 2312-80 character 
15 set, the state-to-state matrix is found to be as follows: 
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0.995157 0.004843 0 
0 0 1 

0.037944 0.962056 0 



(4c) 



The significance of the matrix entries of Figure 4b may also be illustrated by the 
following table: 



5 TABLE 1 





Si = A 


Si = GBl 


Si = GB2 


Si-! = A 


0.995157 


0.004843 


0 


Si., = GBl 


0 


0 


1 


Si., = GB2 


0.037944 


0.962056 


0 



From Table 1 we see, for example, that the probability of an ASCII character (i.e., 
state A) being followed by another ASCII is p(A ->A) = 0.995157, while the probability 
of an ASCII being followed by the first byte of a two-byte character is only p(A ->GB1) 
10 = 0.004843. The numbers in Table 1 represent what the probabilities should be for valid 
text, and are derived from compiling statistics of actual valid text. 

The^^ 1 ' ^ matrix is described by: 
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s,(X,. =x,+l) 



A.(*,=*i) 



A, (AT, =*,+!) 



e r (X,.=* r =255) 



W=255) 



(5a) 



where hs(Xi) are histogram functions of the states and £j(X;) are probabilities of 
associating noise with the 5 th state within r+1 Xi ranges. This indicates that the 5* state 
has been selected as the "noise state" in accordance with the invention as is more fully 
described below. 

For the three-state Markov model of Figure 2, Equation 5a becomes: 
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(5b) 



^(Z,.=127) 0 
8,(^=128) 0 



0 
0 



6,(^=160) 0 



0 



e 2 (X,=161) ^(^,=161) ^(^=161) 



e 2 (X ( . =254) h^X, =254) /j 2 (X,. =254) 
b 3 (X, =255) 0 0 



where /u(Xi) is computed by sampling valid ASCII text, computing the histogram of its 
bytes, and then normalizing them so that the entire column of the matrix (including the 
5 epsilon's) sums to 1. First-byte histogram h\(Xi) is computed by sampling valid two-byte 
text and computing the histogram over the odd bytes, while the second-byte histogram 
ti2(X\) is computed over the even bytes. Histograms h\(Xi) and ^(Xj) are also then 
normalized to cause their respective columns to sum to 1. Notice that the above matrix 
implies that the ASCII state is the designated noise state. 

10 The significance of the I ^ matrix entries with respect to GB 2312-80 is 

described by the following table: 
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TABLE 2 





Si = A 


Sj = GBl 


Si = GB2 


0<Xi<127 


h A (Xi) 


0 


0 


128<Xi<160 


£.(X,) 


0 


0 


161 <Xi<254 


£ 2 (X0 


A/(Xi) 




X = 255 


s 3 (255) 


0 


0 



where it can be seen, for example, that bytes associated with GB 2312-80 Chinese 
characters (i.e., 161 < X* < 254) are most likely to have been generated by states GB1 and 
5 GB2, but have a small error probability, s 2 , of deriving from an ASCII character. 

To generate the state sequence for a given string of characters. Equation 3 is used 
to find the state sequence that maximizes the value of P(Xo ... Xn | S 0 . . . Sn). Then, 
all pairs of bytes labeled GB1 followed by GB2 are analyzed to confirm that they are 
valid GB characters (the GB 2312-80 character set contains some "gaps" within the valid 
10 range of bytes, so a preferred embodiment for such "gapped" character sets is to do this 
checking by looking up the values in a lookup table.) If they do not form a valid 
character both bytes are relabeled as the noise state, in this example as ASCII, in the state 
sequence. 

The state sequence that results from the above operations is passed on to the 
1 5 repair module 1 1 0 and has the following properties: 
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TABLE 3 



1. Any consecutive states GB1 and GB2 are guaranteed 
to correspond to valid two-byte characters. 

2. Any ASCII state A whose value X is 0 < X { < 127 is 
valid. 

3. Any state A whose value X* is not 0 < Xi < 127 may 
be regarded as invalid 

where it can be seen that all of the noise is now localized into one state, namely the 
ASCII state. This is by way of illustration only, because we may modify the procedure 
of the sequence labeler 100 so as to localize all of the noise into one of the other states. 

The repair modufe'l 10 simply identifies invalid ASCII characters in the sequence 
and deletes them. In another embodiment, the repair module 110 may also detect 
ambiguities in the string find make corrections by accessing a database of statistics of 
actual language samples. * 

Though we have described the invention with respect to the GB 23 12-80 Chinese 
character set and with only three states and localized all the noise into the ASCII state, 
the invention is generalizable to any character set having a states, where a> 1. Nor is the 
invention limited to character states having one and two-byte characters, but is 
generalizable to any combination of characters of any number of bytes. The fundamental 
procedure is to utilize a probabilistic model to localize all the error into a one or more 
designated noise states. Also, a state that represents forbidden transitions in byte values 
may always be added to the model and may also be used as a designated noise state. 
Hence, the invention may easily be generalized to all multibyte text and will be found 
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useful as a filter for all manner of character converters and text translators, editors, search 
engines, and the like. 

Further, it may be pointed out that more than one state may be designated as noise 
states and error assigned to each according to a set of rules. This is useful if one wishes 
to segregate different types of noise. Nor must the noise states correspond to actual valid 
states, but rather may be designated as separate individual states on their own. 

The method of the invention may be executed as a program of instructions 
tangibly embodied upon a storage device readable by machine, such as a computer. 

As an example of the workings of the invention, consider the following GB 
2312-80 byte sequence where each two-byte sequence is shown with the Chinese letter it 
represents: 



Byte Sequence 


181 231 


202 211 


189 218 


196 191 


Character 


E& 




— H- 

V 


§ 



Now consider the case where the sequence is corrupted by a stray byte value, 
decimal 189, inserted at the beginning: 



Byte Sequence 
Character 



189 181 



231 202 



211 189 
1^ 



218 196 



191 



Current technology is incapable of determining whether the above sequence is 
valid or not, but a native speaker of Chinese would instantly recognized the corrupted 
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sequence as gibberish. According to the invention, the corrupted byte string is passed to 
the state sequence labeler 100 of Figure 1 where a corresponding state sequence is 
constructed according to Equations 1 through 5: 



Byte Sequence 
State Sequence 
Character 



189 181 
A GB1 



231 202 
GB2 GB1 



211 189 
GB2 GB1 



218 196 
GB2 GB1 



191 
GB2 



10 



The suspected byte sequence and the corresponding state sequence are passed to 
the repair module 110 where, assuming we have chosen to localize the noise to the ASCII 
state, the non- ASCII states are examined to test if they are within valid numeric ranges. 
Any found not to be are relabeled as ASCII. The ASCII states are then examined and 
any ASCII states not within acceptable byte values are considered invalid All the 
non-ASCII in our example are valid, however, so it only remains for the repair module 
1 10 to move to the last step and flag or delete the invalid ASCII characters: 



15 



Byte Sequence 
Character 



189 
(flag/delete) 



181 231 



202 211 

m 



189 218 
1* 



196 191 



thereby recovering the correct sequence. 

Of course, though we have depicted separate state sequence labelers 100 and 
repair modules 110, thesis is for illustrative and clarification purposes only. The 
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functionality of the modules may be integrated together or conversely, further subdivided 
as desired by the practitioner of the invention. 

It should be noted that the "states" of a byte sequence are ambiguous and 
therefore within the control of the user of the invention. The states selected for the 
5 method of the invention are dependent upon the noise one wishes to eliminate. In the 
above example, we used a three state Markov model with the ASCII state designated as 
the noise state. This is sufficient where the only noise we are interested in is invalid 
bytes, but we could expand the Markov model to find other noise. For example, 
additional states for punctuation, consonants, vowels, capitalized letters, and the like may 
10 be added to monitor other forms of corruption. 

One may even use a Markov model that analyzes two or more bytes at a time, 

rather than one at a time. Hence the byte-from-state matrix I would enlarge to 
256 2 x # states with the definition of the states slightly modified. Such a model would 
pick out impermissible byte pairs (e.g., in an English version of the model, the letter q 

15 followed by anything other that a "u" could be picked out as noise, or any consonant not 
properly followed by a vowel). 

It is to be understood that all physical quantities disclosed herein, unless explicitly 
indicated otherwise, are not to be construed as exactly equal to the quantity disclosed, but 
rather about equal to the quantity disclosed. Further, the mere absence of a qualifier such 

20 as "about" or the like, is not to be construed as an explicit indication that any such 
disclosed physical quantity is an exact quantity, irrespective of whether such qualifiers 
are used with respect to any other physical quantities disclosed herein. 
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While preferred embodiments have been shown and described, various 
modifications and substitutions may be made thereto without departing from the spirit 
and scope of the invention. Accordingly, it is to be understood that the present invention 
has been described by way of illustration only, and such illustrations and embodiments as 
have been disclosed herein are not to be construed as limiting to the claims. 
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